Date: 2025-02-26
Title: A non-linear statistical framework to investigate changes in life history patterns within and among fish populations
Authors: Ng Z. W. Clement#; Reis-Santos, Patrick; Gonzalez, Julio G.; Gillanders, Bronwyn M.; Saleh, Muhammad F.; Ong, J. L. Joyce
DOI:
#Corresponding Author: Asian School of the Environment, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798. Email: NG0002NT@e.ntu.edu.sg
This RMarkdown file is a basic step-by-step guide for analysing otolith growth and chemistry. Please refer to Eric J. Pedersen and co-authors (2019), Hierarchical generalized additive models in ecology: an introduction with mgcv, for a detailed explanation on the GAMM syntax.
Please refer to Simon Wood’s (2017) book, Generalized Additive Models An Introduction with R, Second Edition, for an explanation on the theory behind Generalised Additive Models and and smoothing terms.
Other useful references include Simpson, (2018), Modelling Palaeoecological Time Series Using Generalised Additive Models, and Zuur et al., (2009), Mixed Effects Models and Extensions in Ecology with R.
Certain sections of the R code was adapted from the Introduction to GAM and GAMM course provided by Dr Alain Zuur and Dr Elena Ieno from Highland Statistics Ltd.
The following code sets the working directory to where the .Rmd file is saved.
Note: Please save the .csv example files in the same folder as the .Rmd framework guide.
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
The mgcv package provides tools for constructing Generalised Additive Mixed Models (GAMMs), the DHARMa package produces residual diagnostics for hierarchical (multi-level/mixed) regression models, the itsadug package provides functions to visualise smooth terms from the additive models, the bbmle package allows us to perform model selection using Akaike’s Information Criterion, and the zoo package provides tools for linear interpolation and rolling median/means. Please refer to the respective CRAN packages for more information on the functions.
All analyses were performed in R version 4.4.1.
## Data preparation
library(plyr)
library(tidyverse)
library(data.table)
# Checking model assumptions
library(zoo)
library(naniar)
library(DHARMa)
# Additive mixed models and model selection
library(mgcv)
library(AICcmodavg)
library(bbmle)
library(marginaleffects)
library(itsadug)
# R Markdown
library(knitr)
The Mypairs function calculates the pairwise correlation
in a given data frame and creates a scatter plot. R code from Dr Alain
Zuur and Dr Elena Ieno (Highland Statistics Ltd.).
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, use = "pairwise.complete.obs"))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
Mypairs <- function(Z) {
MyVarx <- colnames(Z)
pairs(Z, labels = MyVarx,
cex.labels = 2,
lower.panel = function(x, y, digits=2, prefix="", cex.cor = 6) {
panel.cor(x, y, digits, prefix, cex.cor)},
upper.panel = function(x, y) points(x, y,
pch = 16, cex = 0.8,
col = gray(0.1)))
#print(P)
}
The data preparation steps have been removed from the Markdown file
for simplicity. fread() function provides a faster way to
read large data files, like otolith chemistry data.
Note: We measured a total of 11 elements during LA-ICP-MS. However, we focused on three main elements in our case study.
element_data <- fread("./element_example.csv")
head(element_data)
## FishID Batch distance ElapsedTime_s Ca43_CPS IntStdWv Li_Ca B_Ca
## <char> <char> <num> <num> <num> <num> <num> <num>
## 1: AA399 Batch1 0.000000 0.0000000 3194710 38.8 3.497322 6.379806
## 2: AA399 Batch1 1.174999 0.2349997 3275690 38.8 4.263692 4.664550
## 3: AA399 Batch1 2.350500 0.4700999 3215165 38.8 2.895865 6.339253
## 4: AA399 Batch1 3.525501 0.7051001 3224013 38.8 1.732612 6.321875
## 5: AA399 Batch1 4.700499 0.9400997 3277593 38.8 4.829469 6.218547
## 6: AA399 Batch1 5.875500 1.1751000 3076989 38.8 6.657520 6.623984
## Na_Ca Mg_Ca P_Ca Mn_Ca Cu_Ca Zn_Ca Sr_Ca Ba_Ca
## <num> <num> <num> <num> <num> <num> <num> <num>
## 1: 10854.02 185.2442 135.16952 0.7946818 0.4256509 1.7110894 237.1367 41.12080
## 2: 11025.58 172.7469 60.97281 0.4537337 0.4792490 1.1552539 225.5899 38.05616
## 3: 11951.86 169.0499 109.31789 0.4473945 0.5535919 1.5693848 229.0001 39.07932
## 4: 11089.47 165.0791 53.64245 0.9210075 0.2914868 1.6955393 252.5336 40.22666
## 5: 10984.70 163.1376 101.78624 0.4096791 0.2867209 1.2828876 227.9373 36.89037
## 6: 11861.33 175.4762 88.11345 0.9650105 0.4760581 0.6831584 251.1751 40.60971
## Pb_Ca Species capture_year capture_month TL SL
## <num> <char> <int> <int> <int> <int>
## 1: 0.004623842 Lutjanus malabaricus 2021 11 730 622
## 2: 0.004509533 Lutjanus malabaricus 2021 11 730 622
## 3: 0.000000000 Lutjanus malabaricus 2021 11 730 622
## 4: 0.000000000 Lutjanus malabaricus 2021 11 730 622
## 5: 0.000000000 Lutjanus malabaricus 2021 11 730 622
## 6: 0.000000000 Lutjanus malabaricus 2021 11 730 622
## WW Region
## <num> <char>
## 1: 4980 Riau Archipelago
## 2: 4980 Riau Archipelago
## 3: 4980 Riau Archipelago
## 4: 4980 Riau Archipelago
## 5: 4980 Riau Archipelago
## 6: 4980 Riau Archipelago
str(element_data)
## Classes 'data.table' and 'data.frame': 162128 obs. of 24 variables:
## $ FishID : chr "AA399" "AA399" "AA399" "AA399" ...
## $ Batch : chr "Batch1" "Batch1" "Batch1" "Batch1" ...
## $ distance : num 0 1.17 2.35 3.53 4.7 ...
## $ ElapsedTime_s: num 0 0.235 0.47 0.705 0.94 ...
## $ Ca43_CPS : num 3194710 3275690 3215165 3224013 3277593 ...
## $ IntStdWv : num 38.8 38.8 38.8 38.8 38.8 38.8 38.8 38.8 38.8 38.8 ...
## $ Li_Ca : num 3.5 4.26 2.9 1.73 4.83 ...
## $ B_Ca : num 6.38 4.66 6.34 6.32 6.22 ...
## $ Na_Ca : num 10854 11026 11952 11089 10985 ...
## $ Mg_Ca : num 185 173 169 165 163 ...
## $ P_Ca : num 135.2 61 109.3 53.6 101.8 ...
## $ Mn_Ca : num 0.795 0.454 0.447 0.921 0.41 ...
## $ Cu_Ca : num 0.426 0.479 0.554 0.291 0.287 ...
## $ Zn_Ca : num 1.71 1.16 1.57 1.7 1.28 ...
## $ Sr_Ca : num 237 226 229 253 228 ...
## $ Ba_Ca : num 41.1 38.1 39.1 40.2 36.9 ...
## $ Pb_Ca : num 0.00462 0.00451 0 0 0 ...
## $ Species : chr "Lutjanus malabaricus" "Lutjanus malabaricus" "Lutjanus malabaricus" "Lutjanus malabaricus" ...
## $ capture_year : int 2021 2021 2021 2021 2021 2021 2021 2021 2021 2021 ...
## $ capture_month: int 11 11 11 11 11 11 11 11 11 11 ...
## $ TL : int 730 730 730 730 730 730 730 730 730 730 ...
## $ SL : int 622 622 622 622 622 622 622 622 622 622 ...
## $ WW : num 4980 4980 4980 4980 4980 4980 4980 4980 4980 4980 ...
## $ Region : chr "Riau Archipelago" "Riau Archipelago" "Riau Archipelago" "Riau Archipelago" ...
## - attr(*, ".internal.selfref")=<externalptr>
growth_data <- fread("./growth_example.csv")
head(growth_data)
## FishID Age Year Species Cohort Region Width
## <char> <int> <int> <char> <int> <char> <num>
## 1: AA399 1 2017 Lutjanus malabaricus 2016 Riau Archipelago 1889.313
## 2: AA399 2 2018 Lutjanus malabaricus 2016 Riau Archipelago 496.183
## 3: AA399 3 2019 Lutjanus malabaricus 2016 Riau Archipelago 337.150
## 4: AA399 4 2020 Lutjanus malabaricus 2016 Riau Archipelago 248.092
## 5: AA399 5 2021 Lutjanus malabaricus 2016 Riau Archipelago 222.646
## 6: AA399 6 2021 Lutjanus malabaricus 2016 Riau Archipelago 184.478
## Core Edge formation_month
## <num> <num> <int>
## 1: 248.092 184.478 10
## 2: 248.092 184.478 10
## 3: 248.092 184.478 10
## 4: 248.092 184.478 10
## 5: 248.092 184.478 10
## 6: 248.092 184.478 10
str(growth_data)
## Classes 'data.table' and 'data.frame': 449 obs. of 10 variables:
## $ FishID : chr "AA399" "AA399" "AA399" "AA399" ...
## $ Age : int 1 2 3 4 5 6 1 2 3 4 ...
## $ Year : int 2017 2018 2019 2020 2021 2021 2017 2018 2019 2020 ...
## $ Species : chr "Lutjanus malabaricus" "Lutjanus malabaricus" "Lutjanus malabaricus" "Lutjanus malabaricus" ...
## $ Cohort : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ Region : chr "Riau Archipelago" "Riau Archipelago" "Riau Archipelago" "Riau Archipelago" ...
## $ Width : num 1889 496 337 248 223 ...
## $ Core : num 248 248 248 248 248 ...
## $ Edge : num 184 184 184 184 184 ...
## $ formation_month: int 10 10 10 10 10 10 10 10 10 10 ...
## - attr(*, ".internal.selfref")=<externalptr>
We often start our ablation transect outside the otolith core. The
following chunk removes ablation points (element data) outside the core.
As a result, this requires us to amend the distance column to start from
zero for each sample. This is achieved using the
mutate(across(distance ~ first(.))) function from the
dplyr package.
We will also create a cumulative sum distance column for aligning the element data with the otolith growth data.
core_data <- growth_data %>%
subset(select = c("FishID", "Core")) %>%
distinct(FishID, Core)
element_data <- merge(element_data, y = core_data[, c("FishID", "Core")], by = "FishID", all.x = TRUE)
element_data <- element_data %>%
group_by(FishID) %>%
filter(distance > Core) %>%
ungroup()
head(element_data, 5)
## # A tibble: 5 × 25
## FishID Batch distance ElapsedTime_s Ca43_CPS IntStdWv Li_Ca B_Ca Na_Ca Mg_Ca
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AA399 Batc… 249. 49.8 3224770 38.8 6.07 36.4 13104. 341.
## 2 AA399 Batc… 250. 50.1 3141809 38.8 5.63 17.9 13814. 355.
## 3 AA399 Batc… 251. 50.3 3157029 38.8 6.49 40.4 13820. 343.
## 4 AA399 Batc… 253. 50.5 3366692 38.8 7.47 28.8 12839. 320.
## 5 AA399 Batc… 254. 50.8 2905201 38.8 9.62 35.1 14776. 365.
## # ℹ 15 more variables: P_Ca <dbl>, Mn_Ca <dbl>, Cu_Ca <dbl>, Zn_Ca <dbl>,
## # Sr_Ca <dbl>, Ba_Ca <dbl>, Pb_Ca <dbl>, Species <chr>, capture_year <int>,
## # capture_month <int>, TL <int>, SL <int>, WW <dbl>, Region <chr>, Core <dbl>
## Amend the distance column to start from zero for each individual.
# c_distance represents the corrected distance.
element_data <- element_data %>%
group_by(FishID) %>%
mutate(across(distance, ~ . - first(.), .names = "c_distance")) %>%
ungroup()
## Create a cumulative sum distance column for the growth data
growth_data <- growth_data %>%
group_by(FishID) %>%
arrange(FishID, Age) %>%
mutate(c_distance = cumsum(Width))
growth_data$diff <- growth_data$c_distance
# Duplicate this for rolling join alignment
element_data$c_distance <- round(element_data$c_distance, 3)
The rolling join function from the data.table package
allows us to pinpoint the nearest ablation point that indicates a fully
formed annuli (opaque zone) for each individual. This assignment allows
us to perform linear interpolation between each otolith growth increment
to obtain a relative Age value.
## Create nearest rolling join variable in the data frame
setDT(element_data) ; setDT(growth_data)
element_data <- growth_data[,c("FishID", "c_distance", "Age", "Year", "Cohort", "diff")][element_data, roll = "nearest", on = .(FishID, c_distance)]
element_data$difference <- abs(element_data$c_distance - element_data$diff)
# The difference variable represents the distance from the first increment.
## Minimise the difference between the location of each Age and the ablation spot
element_data <- element_data %>%
group_by(FishID, Age) %>%
dplyr::mutate(Age = case_when(difference == min(difference) ~ Age))
element_data <- element_data %>% mutate(Year = case_when(Age != 'NA' ~ Year, TRUE ~ NA))
element_data <- as.data.frame(element_data)
We also noticed that we ablated a portion of the epoxy adjacent to the sulcus acusticus in one otolith section. This should not be included in our analyses and we replaced them with NA values.
## Replace areas with cracks or damage with NAs
element_data <- element_data %>% mutate(across(12:22, ~ if_else(distance >= 164.975 & distance <= 418.782 & FishID == "AB187", NA_real_, .)))
The following chunk uses the rollmedian() and
rollmean() functions from the zoo package to
remove outlier values and potential data fuzziness. Missing values were
also removed from the element data.
### Create rolling median and mean to remove outliers and potential data fuzziness ####
median_size <- 5
element_data <- element_data %>%
group_by(FishID) %>%
mutate(sLi_Ca = rollmedian(Li_Ca, k = median_size, align = "center", fill = NA),
sB_Ca = rollmedian(B_Ca, k = median_size, align = "center", fill = NA),
sNa_Ca = rollmedian(Na_Ca, k = median_size, align = "center", fill = NA),
sMg_Ca = rollmedian(Mg_Ca, k = median_size, align = "center", fill = NA),
sP_Ca = rollmedian(P_Ca, k = median_size, align = "center", fill = NA),
sMn_Ca = rollmedian(Mn_Ca, k = median_size, align = "center", fill = NA),
sCu_Ca = rollmedian(Cu_Ca, k = median_size, align = "center", fill = NA),
sZn_Ca = rollmedian(Zn_Ca, k = median_size, align = "center", fill = NA),
sSr_Ca = rollmedian(Sr_Ca, k = median_size, align = "center", fill = NA),
sBa_Ca = rollmedian(Ba_Ca, k = median_size, align = "center", fill = NA),
sPb_Ca = rollmedian(Pb_Ca, k = median_size, align = "center", fill = NA))
mean_size <- 10
element_data <- element_data %>%
group_by(FishID) %>%
mutate(sLi_Ca = rollmean(sLi_Ca, k = mean_size, align = "center", fill = NA),
sB_Ca = rollmean(sB_Ca, k = mean_size, align = "center", fill = NA),
sNa_Ca = rollmean(sNa_Ca, k = mean_size, align = "center", fill = NA),
sMg_Ca = rollmean(sMg_Ca, k = mean_size, align = "center", fill = NA),
sP_Ca = rollmean(sP_Ca, k = mean_size, align = "center", fill = NA),
sMn_Ca = rollmean(sMn_Ca, k = mean_size, align = "center", fill = NA),
sCu_Ca = rollmean(sCu_Ca, k = mean_size, align = "center", fill = NA),
sZn_Ca = rollmean(sZn_Ca, k = mean_size, align = "center", fill = NA),
sSr_Ca = rollmean(sSr_Ca, k = mean_size, align = "center", fill = NA),
sBa_Ca = rollmean(sBa_Ca, k = mean_size, align = "center", fill = NA),
sPb_Ca = rollmean(sPb_Ca, k = mean_size, align = "center", fill = NA))
element_data <- element_data %>% subset(select = -c(12:22))
element_data <- as.data.frame(element_data)
setcolorder(element_data, c('FishID','Batch','ElapsedTime_s','c_distance','Age','Year','diff','distance','Species','Region','Cohort','capture_year','capture_month','TL','SL','WW','Ca43_CPS','IntStdWv','Core'))
element_data <- element_data %>% filter(rowSums(across(21:31, ~ is.na(.x))) < 10)
The following chunk performs linear interpolation of Age using the
na.approx() function from the zoo package. To
do so, we will first need to set the minimum Age to zero and the minimum
Year to the Cohort year. c_Age represents a relative Age
measurement between each fully formed increment. A discrete and
continuous Year variable was also created in the chunk.
Cohort (birth year) may also be used as a fixed or
random effect in the additive modelling framework. We did not opt to do
so in this study due to unequal sample sizes across cohort classes. This
can be checked using summarise(n = n_distinct()) function
from the dplyr package.
We will also remove missing values from the relative Age
variable, as these represent ablation spots between the otolith edge and
the final completed annuli.
### Interpolating Age based on the missing NA values present ####
## Define the number of decimal places in both columns
element_data$Age <- round(element_data$Age, 10)
element_data$Year <- round(element_data$Year, 10)
## Replace the minimum Age and Cohort value
element_data <- element_data %>%
group_by(FishID) %>%
mutate(Age = ifelse(c_distance == min(c_distance), 0, Age)) %>%
mutate(Year = ifelse(c_distance == min(c_distance), min(Cohort), Year)) %>%
ungroup()
## Age interpolation
element_data <- element_data %>%
group_by(FishID) %>%
mutate(c_Age = zoo::na.approx(Age, na.rm = FALSE)) %>%
ungroup()
## Year interpolation
element_data <- element_data %>%
group_by(FishID) %>%
mutate(c_Year = zoo::na.approx(Year, na.rm = FALSE)) %>%
ungroup()
## Year fill for each increment
element_data <- element_data %>%
group_by(FishID) %>%
fill(Year, .direction = "downup") %>%
ungroup()
colSums(is.na(element_data))
## FishID Batch ElapsedTime_s c_distance Age
## 0 0 0 0 148863
## Year diff distance Species Region
## 0 0 0 0 0
## Cohort capture_year capture_month TL SL
## 0 0 0 2169 0
## WW Ca43_CPS IntStdWv Core difference
## 0 0 0 0 0
## sLi_Ca sB_Ca sNa_Ca sMg_Ca sP_Ca
## 0 0 0 0 0
## sMn_Ca sCu_Ca sZn_Ca sSr_Ca sBa_Ca
## 0 0 0 0 0
## sPb_Ca c_Age c_Year
## 0 3669 3669
element_data <- element_data %>% drop_na(c_Age, c_Year)
element_data <- element_data %>%
subset(select = -c(Age, diff, distance)) %>%
dplyr::rename(Age = c_Age)
setcolorder(element_data, c('FishID','Batch','ElapsedTime_s','c_distance','Age','Year','c_Year','Cohort','Species','Region','TL','SL','WW','capture_year','capture_month','Ca43_CPS','IntStdWv','Core'))
head(element_data, 5)
## # A tibble: 5 × 30
## FishID Batch ElapsedTime_s c_distance Age Year c_Year Cohort Species
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <chr>
## 1 AA399 Batch1 51.2 7.05 0 2016 2016 2016 Lutjanus …
## 2 AA399 Batch1 51.5 8.22 0.000624 2016 2016. 2016 Lutjanus …
## 3 AA399 Batch1 51.7 9.4 0.00125 2016 2016. 2016 Lutjanus …
## 4 AA399 Batch1 51.9 10.6 0.00187 2016 2016. 2016 Lutjanus …
## 5 AA399 Batch1 52.2 11.8 0.00250 2016 2016. 2016 Lutjanus …
## # ℹ 21 more variables: Region <chr>, TL <int>, SL <int>, WW <dbl>,
## # capture_year <int>, capture_month <int>, Ca43_CPS <dbl>, IntStdWv <dbl>,
## # Core <dbl>, difference <dbl>, sLi_Ca <dbl>, sB_Ca <dbl>, sNa_Ca <dbl>,
## # sMg_Ca <dbl>, sP_Ca <dbl>, sMn_Ca <dbl>, sCu_Ca <dbl>, sZn_Ca <dbl>,
## # sSr_Ca <dbl>, sBa_Ca <dbl>, sPb_Ca <dbl>
Important: Categorical variables must be coded as factors in the
mgcv package.
Zeros do not usually represent a true absence of trace elements in the otolith crystalline structure, but measurements below the detection limit. We replaced the zeros with an infinitesimally small value, 1e-10.
We further truncated the growth data due to limited sample sizes past a certain Age or Year.
element_data[c('FishID','Species','Region')] <- lapply(element_data[c('FishID','Species','Region')], as.factor)
element_data[20:30][element_data[20:30] == 0] <- 0.0000000001
## Truncate data based on the sample depth per Year and Region class bins
growth_data %>%
group_by(Region) %>%
summarise(n = n())
## # A tibble: 2 × 2
## Region n
## <chr> <int>
## 1 Riau Archipelago 199
## 2 Sorong 250
element_data <- element_data %>% filter(!((Region == "Riau Archipelago" & Year <= "2016") | (Region == "Sorong" & Year <= "2006")))
element_data <- element_data %>% filter(Age <= 12)
Note: The inclusion of pch = '.' in
geom_point() allows for quicker rendering of points in the
ggplot2 package.
element_data %>%
ggplot(aes(x = c_distance, y = sSr_Ca, colour = FishID, group = FishID))+
geom_point(size = 0.1, alpha = 0.1, pch = '.')+
geom_line(linewidth = 0.1, alpha = 0.1)+
facet_wrap(Species ~ Region)+
theme_bw()+
theme(legend.position = "none")
There is no collinearity between Age and Year.
Mypairs(element_data[,c("Age", "Year")])
There are clear non-linear relationships between the response variable and the continuous covariates, Age and Year. This suggests that we should use Generalised Additive Models.
element_data %>%
ggplot(aes(y = sSr_Ca, x = Age, colour = Region)) +
geom_point(pch = '.') +
ggtitle("Relationship between Sr and c_Age") +
geom_smooth(aes(group = Region), method = 'gam', colour = 'black') +
theme_bw()
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
element_data %>%
ggplot(aes(y = sSr_Ca, x = Year, colour = Region)) +
geom_point(pch = '.') +
ggtitle("Relationship between Sr and Year") +
geom_smooth(aes(group = Region), method = 'gam', colour = 'black') +
theme_bw()
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
The box plots indicate that there is an individual effect. We need to account for this individual-specific variation in element data using random effects, in addition to the non-linear relationship between response and predictor variables.
element_data %>%
ggplot(aes(y = sSr_Ca, x = FishID)) +
geom_boxplot(alpha = 0.5) +
geom_hline(yintercept = mean(element_data$sSr_Ca), colour = 'red', linewidth = 0.5, linetype = 'dashed') +
theme_bw() +
theme(axis.text.x = element_blank())
The aim of this study is to investigate the variables influencing
changes in otolith element composition (E:Ca) of tropical snapper
populations across the Indo-Pacific region. Generalised Additive Mixed
Models (GAMMs) are implemented via the mgcv package in R to
partition variance in the response variable by Region and Age.
Non-Linear Relationships: Exploratory plots
indicate clear non-linear relationships between E:Ca and the continuous
covariates. This suggests that the standard linear models may not be
appropriate for the data. Smoothers, typically specified using the
default bs = 'tp' function in mgcv, allow for
the modelling of non-linear effects. The k parameter refers
to the maximum allowable degrees of freedom for the smoother,
controlling the flexibility of the model. Although setting a high
k can better capture the complexity of the data, it comes
at a computational cost. It is important to note, as Alain Zuur and
Gavin Simpson mentioned, that it can be difficult to interpret highly
complex wiggly patterns when k > 10. Therefore,
selecting an appropriate k is a critical balance between
model accuracy and interpretability.
Distribution and Link Function: The models use a
Gamma distribution with a log-link function
(family = 'Gamma(link = "log")'), which is appropriate
given that the response variable consists of continuous positive values.
Zero values are replaced with a very small value (i.e., 1e-10), which
represents values below the detection limit, rather than a true absence
of data.
Random effects: Random effects are specified
using the bs = 're' function, which accounts for the
individual-specific variation in data. This allows the model to better
account for group-level differences in E:Ca values.
Computational efficiency: Due to the large size
of the data set (around 120,000 observations), the bam()
function was used instead of gam() to reduce computation
time. gam() is recommended for smaller datasets, where
precision is key, and bam() for larger datasets to improve
computational efficiency by using approximations and faster algorithms.
Additionally, the select = TRUE option penalises functions
in the null space of the penalty matrices for each smooth term (Pedersen
et al., 2019).
Penalties and Derivatives: A penalty can also be
set for the squared first derivative of the function, by specifying
m = 1, rather than the second derivative, which can help to
reduce collinearity between the global smoother and the group-specific
terms. This reduces uncertainty around the global smoother, thereby
improving model stability (Pedersen et al., 2019).
For further details, see Pedersen et al (2019), Hierarchical generalized additive models in ecology: an introduction with mgcv and Gavin Simpson’s Stack Exchange discussion (Last accessed: 27 February 2025).
HA: There are significant non-linear effects of Age, random effects of FishID (individual), and the interaction of FishID and Age, on otolith Sr.
Thin plate regression splines are specified using
bs = 'tp', although other
spline functions can be also be used (e.g., cubic regression spline,
P-splines). The k parameter was set to 24; a biologically
informed decision that is double the maximum truncated age of 12 to
account for intra-annual variation in otolith element patterns.
bs = 're' was included in the model to account for
individual-specific variation in otolith element patterns.
The inclusion of s(FishID, bs = 're') creates random
intercepts (deviations from the overall mean, the model constant term)
for each level of the factors. In linear mixed-effects model syntax
(lme4), this would be equivalent to
(1 | FishID).
s(FishID, Age, bs = 're') adjusts the slope of the
trend of a numeric predictor for each level of the factor. This allows
the effect of Age to vary for each fish. In linear mixed-effects model
syntax (lme4), this would be equivalent to
(Age | FishID).
select = TRUE penalises the linear components of each
smoother term. method = 'fREML' produces fast Restricted
Maximum Likelihood Estimations.
family = 'Gamma(link = "log")' calls the Gamma distribution
in mgcv with a log-link function. m = 2 sets a
penalty for the squared second derivative of the smooth function.
M1a <- bam(sSr_Ca ~
s(Age, k = 24, bs = "tp", m = 2) +
s(FishID, bs = 're') +
s(FishID, Age, bs = 're'),
select = TRUE,
method = "fREML",
family = Gamma(link = "log"),
data = element_data)
The numerical output of M1a indicates that there is a significant relationship between Sr:Ca and Age. There are no clear patterns in the residual distribution.
However, we should also check whether the model is properly specified
using the k.check() and gam.check() functions.
The former performs a basic dimension check, while the latter produces
residual plots in addition to calling on k.check(). The
p-value in gam.check() is computed by simulation: the
residuals are randomly re-shuffled k.rep times to obtain
the null distribution of the differencing variance estimator, provided
there is no pattern in the residuals.
We are primarily interested to know whether the model is wiggly enough. If the effective degrees of freedom (edf) is close to k, it means that the trade-off between smoothness and wiggliness is not balanced.
## Check numerical output
summary(M1a)
##
## Family: Gamma
## Link function: log
##
## Formula:
## sSr_Ca ~ s(Age, k = 24, bs = "tp", m = 2) + s(FishID, bs = "re") +
## s(FishID, Age, bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.04228 0.02582 195.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Age) 21.87 22 3.841e+09 <2e-16 ***
## s(FishID) 64.76 65 1.170e+08 <2e-16 ***
## s(FishID,Age) 65.88 66 6.358e+06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.838 Deviance explained = 83.1%
## fREML = -1.1176e+05 Scale est. = 0.0095842 n = 124254
## Check residual distribution
plot(fitted(M1a), residuals(M1a))
## Check diagnostics
set.seed(100) ; k.check(M1a)
## k' edf k-index p-value
## s(Age) 23 21.86529 1.037316 1
## s(FishID) 66 64.76011 NA NA
## s(FishID,Age) 66 65.87932 NA NA
set.seed(100) ; gam.check(M1a)
##
## Method: fREML Optimizer: perf newton
## full convergence after 10 iterations.
## Gradient range [-0.001357912,0.0007388702]
## (score -111763.7 & scale 0.009584154).
## eigenvalue range [-0.0001402849,62126.54].
## Model rank = 156 / 156
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Age) 23.0 21.9 1.01 0.86
## s(FishID) 66.0 64.8 NA NA
## s(FishID,Age) 66.0 65.9 NA NA
## Preliminary plots
plot.gam(M1a, all.terms = T)
Note: The plot.gam() function plots the component smooth
functions on the scale of the linear predictor.
p <- marginaleffects::plot_predictions(M1a, condition = c("Age"), type = 'response')
p
HA: There are significant non-linear effects of Age, random effects of Region, FishID (individual), and the interaction of FishID and Age, on otolith Sr.
M1b <- bam(sSr_Ca ~
s(Age, k = 24, bs = "tp", m = 2) +
s(Region, bs = 're') +
s(FishID, bs = 're') +
s(FishID, Age, bs = 're'),
select = TRUE,
method = "fREML",
family = Gamma(link = "log"),
data = element_data)
HA: There are significant non-linear effects of Age, Region-level Age smooth functions with identical wiggliness, random effects of FishID (individual), and the interaction of FishID and Age, on otolith Sr.
The GS model (sensu Pedersen et al., 2019) tests the amount of variation in E:Ca that is explained by the shared Age smoother, plus fully penalised (2nd order derivative penalty) group-level Region smooth functions of Age. A fully penalised factor smooth means that there are no functions for which the penalty has no effect.
As Gavin Simpson describes in the following StackExchange links, see
link
1 and link
2 (Last accessed: 27 February 2025), the family of smooths
(bs = "fs") function created in this basis share the same
smoothing parameter and have approximately the same wiggliness. This can
be thought of as being draws from a Gaussian distribution centred on 0
with some standard deviation (which is inversely proportional to the
smoothing parameter for the group means or the group slopes [linear
functions])“.
M1c <- bam(sSr_Ca ~
s(Age, k = 24, bs = "tp", m = 2) +
s(Age, Region, k = 24, bs = "fs", m = 2) +
s(FishID, bs = 're') +
s(FishID, Age, bs = 're'),
select = TRUE,
method = "fREML",
family = Gamma(link = "log"),
data = element_data)
HA: There are significant non-linear effects of Age, Region-level Age smooth functions with different wiggliness, random effects of Region, FishID (individual), and the interaction of FishID and Age, on otolith Sr.
The GI model tests the amount of variation in E:Ca that is explained
by the shared Age smoother, plus group-level Region smooth functions of
Age. Unlike bs = "fs", the factor by smooth
variant creates a separate smoothing parameter for each level of Region.
This means that one smoothing parameter may be wiggly, while the other
may produce a smooth pattern.
Region-level random intercepts (or factor) must be specified in the
model, as they are by default not incorporated in the by =
variable smooth term.
M1d <- bam(sSr_Ca ~
s(Age, k = 24, bs = "tp", m = 2) +
s(Age, by = Region, k = 24, bs = "tp", m = 2) +
s(Region, bs = 're') +
s(FishID, bs = 're') +
s(FishID, Age, bs = 're'),
select = TRUE,
method = "fREML",
family = Gamma(link = "log"),
data = element_data)
Important note:
The choice of m alters the penalty to smooth terms and changes how much your GAM restricts wiggliness.
Pedersen et al (2019) penalises the
squared first derivative of the s(Age, by = Region) term
using m = 1 to reduce collinearity between the global
smoother and group-specific terms, which has been reported to produce
uncertainty around the shared smooth term. However, we noticed that
setting a penalised squared first derivative m = 1 can
sometimes create a smooth function that overfits the data when there are
a sparse number of data points (i.e., fewer otolith E:Ca
observations in older fish).
Dr Shawn Hemelstrand describes in this StackExchange link (Last accessed: 27 February 2025) that you should try to fit your data to the functional relationship that is best defined by the theory and data driving that function, and try when possible to not overfit the data, regardless of the selection of m.
HA: There are significant non-linear effects of Region-level Age smooth functions with identical wiggliness, random effects of FishID (individual), and the interaction of FishID and Age, on otolith Sr.
One may not be particularly interested in the effects of individual
Regions, which is assessed using the by smooth. Instead, we
should think of this function as a single Age smoother superimposed
across Regions to estimate the variation within and between groups.
M1e <- bam(sSr_Ca ~
s(Age, Region, k = 24, bs = "fs", m = 2) +
s(FishID, bs = 're') +
s(FishID, Age, bs = 're'),
select = TRUE,
method = "fREML",
family = Gamma(link = "log"),
data = element_data)
HA: There are significant non-linear effects of Region-level Age smooth functions with different wiggliness, random effects of Region, FishID (individual), and the interaction of FishID and Age, on otolith Sr.
Unlike bs = "fs", the by smooth variant
creates a separate smoothing parameter for each level of Region. We are
also assuming there is no shared (phylogenetic) effect present.
M1f <- bam(sSr_Ca ~
s(Age, by = Region, k = 24, bs = "tp", m = 2) +
s(Region, bs = 're') +
s(FishID, bs = 're') +
s(FishID, Age, bs = 're'),
select = TRUE,
method = "fREML",
family = Gamma(link = "log"),
data = element_data)
The following chunk performs model selection using Akaike’s Information Criterion (AIC) to compare our additive models. Gavin Simpson describes in the following Stack Exchange link that Akaike’s Information Criterion for small sample sizes (AICc) should not be used for GAMs.
Here, we used the AICtab function from the
bbmle package to compare additive models. The optimal model
structure is M1d (GI model).
bbmle::AICtab(M1a, M1b, M1c, M1d, M1e, M1f, base = T, logLik = T, weights = T)
## logLik AIC dLogLik dAIC df weight
## M1d -570536.4 1141423.1 1149.5 0.0 175.1 0.5484
## M1c -570536.7 1141424.8 1149.3 1.6 175.7 0.2422
## M1e -570536.5 1141425.1 1149.5 2.0 176.1 0.2063
## M1f -570541.4 1141433.5 1144.5 10.3 175.3 0.0031
## M1b -571652.9 1143615.4 33.0 2192.3 154.8 <0.001
## M1a -571685.9 1143681.5 0.0 2258.4 154.8 <0.001
## Optimal model structure is M1d.
summary(M1d)
##
## Family: Gamma
## Link function: log
##
## Formula:
## sSr_Ca ~ s(Age, k = 24, bs = "tp", m = 2) + s(Age, by = Region,
## k = 24, bs = "tp", m = 2) + s(Region, bs = "re") + s(FishID,
## bs = "re") + s(FishID, Age, bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5351 0.0218 253.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Age) 2.016e+01 23 4.645e+01 <2e-16 ***
## s(Age):RegionRiau Archipelago 1.756e+01 23 3.660e+00 <2e-16 ***
## s(Age):RegionSorong 6.548e+00 23 4.030e-01 <2e-16 ***
## s(Region) 5.458e-06 1 0.000e+00 0.0104 *
## s(FishID) 6.444e+01 65 2.410e+06 <2e-16 ***
## s(FishID,Age) 6.408e+01 66 3.776e+06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.84 Deviance explained = 83.4%
## fREML = -1.1299e+05 Scale est. = 0.0094024 n = 124254
The following chunk fits a discrete Year slope and Region random intercept to the optimal demographic model to investigate the presence of temporal patterns in otolith E:Ca. Model selection using AIC was then used to identify the optimal model structure to otolith chemistry data.
Model selection indicates that the null model without temporal effects provides a marginally better fit for otolith Sr:Ca.
M2 <- bam(sSr_Ca ~
s(Age, k = 24, bs = "tp", m = 2) +
s(Age, by = Region, k = 24, bs = "tp", m = 2) +
s(Region, bs = 're') +
s(Region, Year, bs = 're') +
s(FishID, bs = 're') +
s(FishID, Age, bs = 're'),
select = TRUE,
method = "fREML",
family = Gamma(link = "log"),
data = element_data)
bbmle::AICtab(M1d, M2, base = T, logLik = T, weights = T)
## logLik AIC dLogLik dAIC df weight
## M1d -570536.4 1141423.1 44.0 0.0 175.1 1
## M2 -570580.4 1141511.1 0.0 88.0 175.1 <0.001
For simplicity, we will refit the optimal model as
Lm_Sr. The following chunk checks the numerical output,
residual plot, diagnostic information, and plots the component smooth
functions on the scale of the linear predictor.
The plot_predictions() function from the
marginaleffects package is used to plot the effects at the
scale of the response variable, while setting the group-specific random
effects to zero.
## Construct the optimal model for Sr
Lm_Sr <- M1d
## Check numerical output
summary(Lm_Sr)
##
## Family: Gamma
## Link function: log
##
## Formula:
## sSr_Ca ~ s(Age, k = 24, bs = "tp", m = 2) + s(Age, by = Region,
## k = 24, bs = "tp", m = 2) + s(Region, bs = "re") + s(FishID,
## bs = "re") + s(FishID, Age, bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5351 0.0218 253.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Age) 2.016e+01 23 4.645e+01 <2e-16 ***
## s(Age):RegionRiau Archipelago 1.756e+01 23 3.660e+00 <2e-16 ***
## s(Age):RegionSorong 6.548e+00 23 4.030e-01 <2e-16 ***
## s(Region) 5.458e-06 1 0.000e+00 0.0104 *
## s(FishID) 6.444e+01 65 2.410e+06 <2e-16 ***
## s(FishID,Age) 6.408e+01 66 3.776e+06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.84 Deviance explained = 83.4%
## fREML = -1.1299e+05 Scale est. = 0.0094024 n = 124254
anova(Lm_Sr)
##
## Family: Gamma
## Link function: log
##
## Formula:
## sSr_Ca ~ s(Age, k = 24, bs = "tp", m = 2) + s(Age, by = Region,
## k = 24, bs = "tp", m = 2) + s(Region, bs = "re") + s(FishID,
## bs = "re") + s(FishID, Age, bs = "re")
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Age) 2.016e+01 2.300e+01 4.645e+01 <2e-16
## s(Age):RegionRiau Archipelago 1.756e+01 2.300e+01 3.660e+00 <2e-16
## s(Age):RegionSorong 6.548e+00 2.300e+01 4.030e-01 <2e-16
## s(Region) 5.458e-06 1.000e+00 0.000e+00 0.0104
## s(FishID) 6.444e+01 6.500e+01 2.410e+06 <2e-16
## s(FishID,Age) 6.408e+01 6.600e+01 3.776e+06 <2e-16
## Check residual plot
plot(fitted(Lm_Sr), residuals(Lm_Sr))
## Check diagnostic information for our optimal model
set.seed(100) ; k.check(Lm_Sr)
## k' edf k-index p-value
## s(Age) 23 2.016175e+01 1.03012 0.9825
## s(Age):RegionRiau Archipelago 23 1.755876e+01 1.03012 0.9925
## s(Age):RegionSorong 23 6.548451e+00 1.03012 0.9975
## s(Region) 2 5.458072e-06 NA NA
## s(FishID) 66 6.443964e+01 NA NA
## s(FishID,Age) 66 6.408081e+01 NA NA
set.seed(100) ; gam.check(Lm_Sr)
##
## Method: fREML Optimizer: perf newton
## full convergence after 30 iterations.
## Gradient range [-0.0004977498,0.0001388315]
## (score -112988.7 & scale 0.009402378).
## eigenvalue range [-1.516234e-05,62126.54].
## Model rank = 204 / 204
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(Age) 2.30e+01 2.02e+01 1.01 0.83
## s(Age):RegionRiau Archipelago 2.30e+01 1.76e+01 1.01 0.78
## s(Age):RegionSorong 2.30e+01 6.55e+00 1.01 0.77
## s(Region) 2.00e+00 5.46e-06 NA NA
## s(FishID) 6.60e+01 6.44e+01 NA NA
## s(FishID,Age) 6.60e+01 6.41e+01 NA NA
## Generate preliminary plots
plot.gam(Lm_Sr, all.terms = T)
## Plot effects on response scale
p <- marginaleffects::plot_predictions(Lm_Sr, condition = c("Age","Region"), type = 'response') + theme_bw()
p
The following chunk assesses the numerical output from the optimal GAMM. The r parameter and sigma of the random effects are extracted from the fitted gamma GAMM. We will also extract and plot the Pearson residuals against the gamma fitted values to assess if there are any underlying patterns.
The code is adapted from the Introduction to GAM and GAMM course material by Highland Statistics Ltd.
## Extract the r parameter from the fitted gamma GAMM model
# See scale estimated from `summary()`
summary(Lm_Sr)
##
## Family: Gamma
## Link function: log
##
## Formula:
## sSr_Ca ~ s(Age, k = 24, bs = "tp", m = 2) + s(Age, by = Region,
## k = 24, bs = "tp", m = 2) + s(Region, bs = "re") + s(FishID,
## bs = "re") + s(FishID, Age, bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5351 0.0218 253.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Age) 2.016e+01 23 4.645e+01 <2e-16 ***
## s(Age):RegionRiau Archipelago 1.756e+01 23 3.660e+00 <2e-16 ***
## s(Age):RegionSorong 6.548e+00 23 4.030e-01 <2e-16 ***
## s(Region) 5.458e-06 1 0.000e+00 0.0104 *
## s(FishID) 6.444e+01 65 2.410e+06 <2e-16 ***
## s(FishID,Age) 6.408e+01 66 3.776e+06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.84 Deviance explained = 83.4%
## fREML = -1.1299e+05 Scale est. = 0.0094024 n = 124254
r <- 1 / 0.0094024
r
## [1] 106.3558
# sSr_Ca ~ Gamma(mu_ijk, r = 106.3558)
# E(sSr_Ca) = mu_ijk
# var(sSr_Ca) = mu_ijk ^2 / 106.3558
## Attain the sigma of the random effect of FishID
gam.vcomp(Lm_Sr)
##
## Standard deviations and 0.95 confidence intervals:
##
## std.dev lower upper
## s(Age)1 3.124660e-01 2.165408e-01 4.508849e-01
## s(Age)2 5.960469e-01 1.485625e-01 2.391397e+00
## s(Age):RegionRiau Archipelago1 2.579805e-01 1.651198e-01 4.030644e-01
## s(Age):RegionRiau Archipelago2 7.162964e-05 0.000000e+00 Inf
## s(Age):RegionSorong1 1.317815e-01 3.754231e-02 4.625813e-01
## s(Age):RegionSorong2 4.992506e-03 4.095676e-22 6.085715e+16
## s(Region) 7.185622e-05 7.157404e-05 7.213950e-05
## s(FishID) 1.519421e-01 1.262983e-01 1.827927e-01
## s(FishID,Age) 4.719586e-02 3.954391e-02 5.632849e-02
## scale 9.696586e-02 9.658509e-02 9.734814e-02
##
## Rank: 9/10
## Perform model validation for the Gamma GAMM
element_data$E <- resid(Lm_Sr, type = "pearson")
# Pearson residual values
element_data$mu <- fitted(Lm_Sr)
# Gamma fitted values
## Plot residuals versus fitted values
element_data %>%
filter(Age < 12) %>%
ggplot(aes(y = E, x = mu)) +
geom_point(size = 0.5, alpha = 0.3) +
xlab("Fitted values") + ylab("Residuals") +
theme_bw()
# Plot observed versus fitted values
element_data %>%
ggplot(aes(y = sSr_Ca, x = mu)) +
geom_point(size = 0.5, alpha = 0.5) +
xlab("Fitted values") + ylab("sSr_Ca") +
theme_bw()
## Plot residuals against Age data
element_data %>%
ggplot(aes(y = E, x = Age, colour = Region)) +
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(colour = 'black') +
facet_wrap(.~Region) +
xlab("Age") + ylab("Residuals") +
theme_bw()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
There are no clear underlying patterns in the otolith chemistry data.
The following chunk simulates residuals from the fitted model to assess for uniformity and check for outliers in the data. While the plots seem to indicate the presence of outliers, non-uniformity and dispersion in the data, visual observations suggest that this is okay.
According to the package vignette for DHARMa, Florian Hartig states in the “General remarks on interperting residual patterns and tests” section of the following link that random effect estimates are excluded in the predictions when plotting residual versus predicted data. He adds that DHARMa residuals often show a slight pattern in the residuals even if the model is correctly specified, and tests for this can get significant for large sample sizes.
Lm_Sr_Residuals <- simulateResiduals(fittedModel = Lm_Sr, plot = FALSE)
par(mfrow = c(1,1))
plotQQunif(Lm_Sr_Residuals,
testUniformity = TRUE,
testOutliers = TRUE,
testDispersion = TRUE)
DHARMa::plotResiduals(Lm_Sr_Residuals, quantreg = TRUE, smoothScatter = TRUE)
plotResiduals(Lm_Sr_Residuals, form = element_data$Age, smoothScatter = TRUE)
The following chunk extracts the random effects and checks whether they are independent and normally distributed.
According to the course notes from Highland Statistics Ltd., the idea
is to specify a data frame with only unique values for
s(FishID), with everything else set to a constant value.
The predict() function is then used to predict values of
the random effects and their corresponding standard errors. The
estimated random effects and their standard errors are used to check for
dependency among the random effects.
coef(Lm_Sr)
data2 <- data.frame(FishID = levels(element_data$FishID))
data2$Age <- mean(element_data$Age)
data2$Region <- "Riau Archipelago"
## Predict values of random effects and their SE,
p.Lm_Sr <- predict(Lm_Sr, data2,
type = "terms", se.fit = TRUE)
## Extract the estimated random effects and their standard errors
data2$ai <- p.Lm_Sr[["fit"]][ , "s(FishID)"]
data2$ai.se <- p.Lm_Sr[["se.fit"]][ , "s(FishID)"]
## Plot in a dotchart
data2$rowID <- 1:nrow(data2)
p <- data2 %>%
ggplot(aes(x = ai, y =rowID)) +
geom_point(size = 1, col = grey(0.5)) +
geom_errorbarh(aes(y = rowID, xmax = ai + 1.96 * ai.se, xmin = ai - 1.96 * ai.se), col = "blue", height = 0.05, alpha = 0.2) +
geom_vline(xintercept = 0, lty = 2, col = "red") +
xlab("Estimated random effect") + ylab("FishID") +
scale_x_continuous(limits = c(-0.75, 0.75), breaks = c(-0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75), labels = c(-0.75, -0.50, -0.25, 0.00, 0.25, 0.50, 0.75)) +
theme_bw()
p
There is no clear pattern in this dot chart, which is good, as we assume that the random effects are independent.
The plot_smooth function from the itsadug
package plots the fitted effects from the optimal additive model. The
argument transform accepts a function for transforming the
fitted values into the original scale. The argument
rm.ranef cancels the contribution of random effects.
We will then predict how Sr varies with Age for each Region-level after removing the effect of random effects, which is specified as an NA value in the “Lm_Sr_Age_Data” data frame. The code is adapted from the Introduction to GAM and GAMM course material by Highland Statistics Ltd.
Note: Each smooth term should be excluded in the same way it was
labelled in summary(), not during model construction. This
can be checked using sapply(Lm_Sr$smooth,"[[","label").
### Preliminary plot
par(mfrow = c(1, 1))
itsadug::plot_smooth(Lm_Sr, view = "Age", plot_all = c("Region"), rm.ranef = FALSE)
## Summary:
## * Age : numeric predictor; with 30 values ranging from 0.000000 to 12.000000.
## * Region : factor; set to the value(s): Riau Archipelago, Sorong.
## * FishID : factor; set to the value(s): AB607.
# Plot without random effects
itsadug::plot_smooth(Lm_Sr, view = "Age", plot_all = c("Region"), transform = exp, rm.ranef = FALSE)
## Summary:
## * Age : numeric predictor; with 30 values ranging from 0.000000 to 12.000000.
## * Region : factor; set to the value(s): Riau Archipelago, Sorong.
## * FishID : factor; set to the value(s): AB607.
# Transform to original scale
## Plot changes in Sr across Age for each Region.
Lm_Sr_Age_Data <- ddply(element_data, .(Region), summarize,
Age = seq(min(Age), max(Age), length = 1000))
Lm_Sr_Age_Data$FishID <- "NA"
Lm_Sr_Age_Data$FishID <- factor(Lm_Sr_Age_Data$FishID)
Ignore the NA warning.
Note: You will need to calculate average Year value and include it in the predicted model if the temporal term shows up as important.
## Predict how Sr varies with Age for each Region after removing random effects
sapply(Lm_Sr$smooth,"[[","label")
## [1] "s(Age)" "s(Age):RegionRiau Archipelago"
## [3] "s(Age):RegionSorong" "s(Region)"
## [5] "s(FishID)" "s(FishID,Age)"
Lm_Sr_PredictAge <- predict.bam(Lm_Sr,
newdata = Lm_Sr_Age_Data,
exclude = c("s(Region)",
"s(FishID)",
"s(FishID,Age)"),
se.fit = TRUE)
## Warning in predict.gam(object, newdata = newdata, type = type, se.fit = se.fit,
## : factor levels NA not in original fit
Lm_Sr_Age_Data$mu <- exp(Lm_Sr_PredictAge$fit)
Lm_Sr_Age_Data$ul <- exp(Lm_Sr_PredictAge$fit + 1.96 * Lm_Sr_PredictAge$se.fit )
Lm_Sr_Age_Data$ll <- exp(Lm_Sr_PredictAge$fit - 1.96 * Lm_Sr_PredictAge$se.fit )
Lm_Sr_Age_Data <- Lm_Sr_Age_Data %>% filter(Age <= 12)
The following code plots the predicted marginal effect of the demographic effects on Sr, while setting all random effects to a constant value.
### Plot marginal effect of Age:Region on Sr:Ca, setting all random effect values to a constant value.
plot1 <- Lm_Sr_Age_Data %>%
ggplot(aes(x = Age, y = mu)) +
geom_line(aes(colour = Region)) +
geom_ribbon(aes(ymax = ul, ymin = ll, fill = Region), alpha = 0.5) +
scale_color_manual(values=c("#E69F00","#0072B2")) +
scale_fill_manual(values=c("#E69F00","#0072B2")) +
labs(y = 'Predicted Sr:Ca (mmol/mol)') +
scale_x_continuous(limits = c(0, 12), breaks = c(0,3,6,9,12), labels = c("0","3","6","9","12")) +
scale_y_continuous(limits = c(180, 800), breaks = c(200, 400, 600, 800), labels = c("200", "400", "600", "800")) +
theme(axis.text = element_text(size=16),
axis.title = element_text(size=20),
axis.line = element_line(colour = "grey15", linewidth=0.3),
legend.title=element_text(size=16),
legend.text=element_text(size=14),
legend.background = element_rect(fill='white',linewidth=0.5,linetype="solid",colour ="grey15"),
panel.border = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line(colour = "grey85", linewidth=0.15),
panel.grid.minor = element_blank(),
plot.margin = margin(1.0, 1.0, 1.0, 1.0, "cm"))
plot1
## Save file as .png for publication
#ggsave("plot1.png", plot = plot1, dpi = 300, width = 297, height = 210, units = "mm")
For any code-related questions, please contact Clement Ng at NG0002NT@e.ntu.edu.sg.
Thank you.